Training language models to follow instructions with human feedback

https://proceedings.neurips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html

We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. (Abstract)

We call the resulting models InstructGPT.

InstructGPT論文

In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback.

outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters

Figure 2

Step 1: supervised fine-tuning (SFT)

Step 2: reward model (RM) training

作ったモデルに入力xから生成させ、人間が順位付け

一番好まれた(y_win), 一番好まれなかった(y_lose) <- これが人間の選好データ

報酬モデルを訓練

x, y_win, y_lose

(x, y_win)の報酬を(x, y_lose)より高くなるように学習

LLMの最終層に線形層を追加することが多い

式 (1) (3.5より)

（Bradlyey-Terryモデルに従うと仮定している？）

Step 3: reinforcement learning via proximal policy optimization (PPO) on this reward model

近傍方策最適化 PPO

ここもLLMが使われる

指示チューニングのデータからx、yが生成される

報酬の最大化

指示チューニングの出力から乖離しすぎないようにする

yを生成するために直接微分できない -> 強化学習 (PPO)

（式 (2)）

文脈付きバンディット

人間のlabelerはモデルの出力の2つ組すべてをどちらがよいかつけるらしい（TODO 要確認、上をアップデート）

問題点2つ

Reward Hacking（報酬モデルの穴を付く）

Learning to summarize from human feedback Figure 5

Alignment Tax

Aligning language models to follow instructions

https://arxiv.org/abs/2309.10202 も見たい（TODO）

2つを組み合わせた目的関数がPPO-ptx

We also experiment with mixing the pretraining gradients into the PPO gradients, in order to fix the performance regressions on public NLP datasets. We call these models “PPO-ptx （式(2)が続く。3.5）

データの集め方も工夫（3.2 labeler?）

HHHでの評価結果はAligning language models to follow instructionsに

ichikara-instruction LLMのための日本語インストラクションデータの作成によると

SFT における高品質なインストラクションが重要な役割を担っていることを報告

公開されていないRLHF実装の再現

The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization

流れとしてはLearning to summarize from human feedbackから（要確認）